7.4 Compression

85

greatest for files with a lot of repetitive material, but according to van der Waerden’s

(1927) extension of Baudet’s conjecture, any string of two kinds of symbols has

repetitive sequences of at least one of the symbols.

7.4.1

Use of Compression to Measure Distance

Suppose two ergodic binary sources upper PP and upper QQ emit 1s with probabilities pp and qq,

respectively. The Kullback–Leibler (1951) relative entropy between the two strings

is

upper S Subscript upper P upper Q Baseline equals minus q log Subscript 2 Baseline StartFraction p Over q EndFraction minus left parenthesis 1 minus q right parenthesis log Subscript 2 Baseline StartFraction 1 minus p Over 1 minus q EndFractionSP Q = −q log2

p

q (1q) log2

1p

1q

(7.8)

and may be used as the basis of a measure of distance between the two strings.

Benedetto et al. (2002) have devised an ingenious method for estimating upper S Subscript upper P upper QSP Q from

two sources by zipping a long string from each source (upper PP andupper QQ), to each of which,

prior to zipping, is appended a sufficiently short string fragment (say upper P primeP,) from one

of the sources. upper S Subscript upper P upper QSP Q is then the difference in coding efficiency between upper P primeP, coded

optimally because it follows upper PP (the source is ergodic) and upper P primeP, coded suboptimally

because it follows upper QQ. Using upper LL to denote the length of a zipped file,

upper S Subscript upper P upper Q Baseline equals left bracket left parenthesis upper L Subscript upper Q plus upper P prime Baseline minus upper L Subscript upper Q Baseline right parenthesis minus left parenthesis upper L Subscript upper P plus upper P prime Baseline minus upper L Subscript upper P Baseline right parenthesis right bracket divided by upper L prime Subscript upper P primeSP Q = [(L Q+P,L Q)(L P+P,L P)]/L,

P,

(7.9)

(in bits per character), whereupper L prime Subscript upper P primeL,

P, is the unzipped length of the short string fragment

upper P primeP,. In order to eliminate dependency on the particular coding, a different normaliza-

tion may be used:

upper S Subscript upper P upper Q Baseline equals StartFraction left parenthesis upper L Subscript upper Q plus upper P Sub Superscript prime Subscript Baseline minus upper L Subscript upper Q Baseline right parenthesis minus left parenthesis upper L Subscript upper P plus upper P Sub Superscript prime Subscript Baseline minus upper L Subscript upper P Baseline right parenthesis Over upper L Subscript upper P plus upper P Sub Superscript prime Subscript Baseline EndFraction plus StartFraction left parenthesis upper L Subscript upper P plus upper Q Sub Superscript prime Subscript Baseline minus upper L Subscript upper P Baseline right parenthesis minus left parenthesis upper L Subscript upper Q plus upper Q Sub Superscript prime Subscript Baseline minus upper L Subscript upper Q Baseline right parenthesis Over upper L Subscript upper Q plus upper Q Sub Superscript prime Subscript Baseline EndFraction periodSP Q = (L Q+P,L Q)(L P+P,L P)

L P+P,

+ (L P+Q,L P)(L Q+Q,L Q)

L Q+Q,

.

(7.10)

7.4.2

Ergodicity

Ergodicity means that every allowable point in phase space is visited infinitely often

in infinite time or, in practice, every allowable point in phase space is approached

arbitrarily closely after a long time. Ergodicity is a pillar of Boltzmann’s assumption

that the microstates of an ensemble have equal a priori probabilities, and indeed of

the rest of statistical mechanics. Nevertheless, as our knowledge of the world has

increased, it has become apparent that ergodicity actually applies only to a small

minority of natural systems. Although some systems may not even be ergodic in

the infinite time limit, most observed departures from ergodicity occur because of

the inordinately long times that would be required to fulfil it. The departures are